git-annex (10.20250116) UNRELEASED; urgency=medium
+ * Added the compute special remote.
+ * addcomputed: New command, adds a file that is generated by a compute
+ special remote.
+ * recompute: New command, recomputes computed files.
* Support help.autocorrect settings "prompt", "never", and "immediate".
* Allow setting remote.foo.annex-tracking-branch to a branch name
that contains "/", as long as it's not a remote tracking branch.
+++ /dev/null
-* would be nice to have a way to see what computations are used by a
- compute remote for a file. Put it in `whereis` output? But it's not an
- url. Maybe a separate command? That would also allow querying for eg,
- what files are inputs for another file. Or it could be exposed in the
- Remote interface, and made into a file matching option.
-
-* "getting input from <file>" message uses the original filename,
- but that file might have been renamed. Would be more clear to use
- whatever file in the tree currently points to the key it's getting
- (what if there is not one?)
-
-* allow git-annex enableremote with program= explicitly specified,
- without checking annex.security.allowed-compute-programs
-
-* addcomputed should honor annex.addunlocked.
-
- What about recompute? It seems it should either write the new version of
- the file as an unlocked file when the old version was unlocked, or also
- honor annex.addunlocked.
-
- Problem: Since recompute does not stage the file, it would have to write
- the content to the working tree. And then the user would need to
- git-annex add. But then, if the key was a VURL key, it would add it with
- the default backend instead, and the file would no longer use a computed
- key.
-
- So it, seems that, for this to be done, recompute would need to stage the
- pointer file.
-
-* compute on files in submodules
-
-* recompute could ingest keys for other files than the one being
- recomputed, and remember them. Then recomputing those files could just
- use those keys, without re-running a computation. (Better than --others
- which got removed.)
-
-* `git-annex recompute foo bar baz`, when foo depends on bar which depends
- on baz, and when baz has changed, will not recompute foo, because bar has
- not changed. It then recomputes bar. So running the command again is
- needed to recompute foo.
-
- What it could do is, after it recomputes bar, notice that it already
- considered foo, and revisit foo, and recompute it then. It could either
- use a bloom filter to remember the files it considered but did not
- compute, or it could just notice that the command line includes foo
- (or includes a directory that contains foo), and then foo is not
- modified.
-
- Or it could build a DAG and traverse it, but building a DAG of a large
- directory tree has its own problems.
-
-* Should addcomputed honor annex.smallfiles? That would seem to imply
- that recompute should also support recomputing non-annexed files.
- Otherwise, adding a file and then recomputing it would vary in
- what the content of the file is, depending on annex.smallfiles setting.
--- /dev/null
+[[!comment format=mdwn
+ username="joey"
+ subject="""Re: DataLad exploration of the compute on demand space"""
+ date="2025-03-06T17:39:04Z"
+ content="""
+Thanks for explaining the design points of datalad-remake. Some
+different design choices than I have made, but mostly they strike me as
+implementing what is easier/possible from outside git-annex.
+
+Eg, storing the compute inputs under `.datalad` in the branch is fine --
+and might even be useful if you want to make a branch that changes
+something in there -- but of course in the git-annex implementation it
+stores the equvilant thing in the git-annex branch.
+
+I do hope I'm not closing off the design space from such differences
+by dropping a compute special remote right into git-annex. But I also
+expect that having a standard and easy way for at least simple
+computations will lead to a lot of contributions as others use it.
+
+Your fMRI case seems like one that my compute remote could handle well
+and easily.
+"""]]
--- /dev/null
+[[!comment format=mdwn
+ username="joey"
+ subject="""comment 22"""
+ date="2025-03-06T17:54:50Z"
+ content="""
+I've merged the compute special remote now.
+See [[special_remotes/compute]], [[git-annex-addcomputed]]
+and [[git-annex-recompute]].
+
+I have opened [[todo/compute_special_remote_remaining_todos]] with
+some various ways that I want to improve it further. Including, notably,
+computing on inputs from submodules, which is not currently supported at
+all.
+
+----
+
+Here I'll go down mih's original and quite useful design criteria and see
+how the compute special remote applies to them:
+
+### Generate annex keys (that have never existed)
+
+`git-annex addcomputed --fast`
+
+### Re-generate annex keys
+
+`git-annex addcomputed` optionally with the --reproducible option,
+followed by a later `git-annex get`
+
+Another thing that fits under this heading is when one of the original
+input files has gotten modified, and you want to compute a new version of
+the output file from it, using the same method as was used to compute it
+before. That's `git-annex recompute $output_file`
+
+### Worktree provisioning?
+
+This is the main thing I didn't implement. Given that git-annex is working
+with large files and needs to support various filesystems and OS's that
+lack hardlinks and softlinks, it's hard to do this inexpensively.
+
+Also, it turned out to make sense for the compute program to request
+the input files it needs, since this lets git-annex learn what the input
+files are, so it can make them available when regenerating a computed file
+later. And so the protocol just has git-annex respond with the path to
+the content of the file.
+
+### Request one key, receive many
+
+This is supported. (So is using multiple inputs to produce one (or more)
+outputs.)
+
+### Instruction deposition
+
+`git-annex addcomputed`
+
+### Storage redundancy tests
+
+It did make sense to have it automatically `git-annex get` the inputs.
+Well, I think it makes sense in most cases, this may become a tunable
+setting of the compute special remote.
+
+### Trust
+
+Handled by requiring the user install a `git-annex-compute-foo` command
+in PATH, and provide the name of the command to `initremote`.
+
+And for later `enableremote` or `autoenable=true`, it will only
+allow programs that are listed in the annex.security.allowed-compute-programs
+git config.
+"""]]
--- /dev/null
+This is the remainder of my todo list while I was building the
+compute special remote. --[[Joey]]
+
+* write a tip showing how to use this
+
+* Write some simple compute programs so we have something to start with.
+
+ - convert between images eg jpeg to png
+ - run a command in a singularity container (that is one of the inputs)
+ - run a wasm binary (that is one of the inputs)
+
+* compute on input files in submodules
+
+* annex.diskreserve can be violated if getting a file computes it but also
+ some other output files, which get added to the annex.
+
+* would be nice to have a way to see what computations are used by a
+ compute remote for a file. Put it in `whereis` output? But it's not an
+ url. Maybe a separate command? That would also allow querying for eg,
+ what files are inputs for another file. Or it could be exposed in the
+ Remote interface, and made into a file matching option.
+
+* "getting input from <file>" message uses the original filename,
+ but that file might have been renamed. Would be more clear to use
+ whatever file in the tree currently points to the key it's getting
+ (what if there is not one?)
+
+* allow git-annex enableremote with program= explicitly specified,
+ without checking annex.security.allowed-compute-programs
+
+* addcomputed should honor annex.addunlocked.
+
+ What about recompute? It seems it should either write the new version of
+ the file as an unlocked file when the old version was unlocked, or also
+ honor annex.addunlocked.
+
+ Problem: Since recompute does not stage the file, it would have to write
+ the content to the working tree. And then the user would need to
+ git-annex add. But then, if the key was a VURL key, it would add it with
+ the default backend instead, and the file would no longer use a computed
+ key.
+
+ So it, seems that, for this to be done, recompute would need to stage the
+ pointer file.
+
+* recompute could ingest keys for other files than the one being
+ recomputed, and remember them. Then recomputing those files could just
+ use those keys, without re-running a computation. (Better than --others
+ which got removed.)
+
+* `git-annex recompute foo bar baz`, when foo depends on bar which depends
+ on baz, and when baz has changed, will not recompute foo, because bar has
+ not changed. It then recomputes bar. So running the command again is
+ needed to recompute foo.
+
+ What it could do is, after it recomputes bar, notice that it already
+ considered foo, and revisit foo, and recompute it then. It could either
+ use a bloom filter to remember the files it considered but did not
+ compute, or it could just notice that the command line includes foo
+ (or includes a directory that contains foo), and then foo is not
+ modified.
+
+ Or it could build a DAG and traverse it, but building a DAG of a large
+ directory tree has its own problems.
+
+* Should addcomputed honor annex.smallfiles? That would seem to imply
+ that recompute should also support recomputing non-annexed files.
+ Otherwise, adding a file and then recomputing it would vary in
+ what the content of the file is, depending on annex.smallfiles setting.